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Abstract We discuss the performance characteristics of using the modification of the tree 
code suggested by Barnes (Barnes, 1990) in the context of the TreePM code. The optimisa- 
tion involves identifying groups of particles and using only one tree walk to compute force 
for all the particles in the group. This modification has been in use in our implementation of 
the TreePM code for some time, and has also been used by others in codes that make use of 
tree structures. In this paper, we present the first detailed study of the performance charac- 
teristics of this optimisation. We show that the modification, if tuned properly can speed up 
the TreePM code by a significant amount. We also combine this modification with the use of 
individual time steps and indicate how to combine these two schemes in an optimal fashion. 
We find that the combination is at least a factor of two faster than the modified TreePM with- 
out individual time steps. Overall performance is often faster by a larger factor, as the scheme 
of groups optimises use of cache for large simulations. 
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1 INTRODUCTION 

Large scale structures traced by galaxies are believed t o have formed by a r nplific ation of small perturbations 
(iPeeblesl Il980t iPeacockl Il999t jPadmanabhani l2002t iBernardeau et all |2002|) . Galaxies are highly over- 
dense systems, matter density p in galaxies is thousands of times larger than the average density p in the 
universe. Typical density contrast {5 = p/ p — 1) in matter at these scales in the early universe was much 
smaller than unity. Thus the problem of galaxy formation and the large scale distribution of galaxies requires 
an understanding of the evolution of density perturbations from small initial values to the large values we 
encounter today. 

Initial density p erturbations were present at all scales that have been observed (ISpergel et al.l l2007l 
iPercival et al.l l2007h . The equations that describe th e evolution of density perturbations in an expand- 
ing universe have been known for several decades (iPeeblesL 1 1974 and these are easy to solve when 
the amplitude of perturbations is small. Once density contrast at relevant scales becomes compara- 
ble to unity, perturbations becomes non-linear and coupling with perturbations at other scales can- 
not be ignored. The equation for evolution of density perturb ations cannot be solved for generic 
initial conditions in this regime. N-Body simulatio ns (e.g., see (lEfstathiou et al.l 1 19851 : iBertschingerl 
1 19981 iBagla & Padmanabhanl 1 19971: iBaglal |2005|) ) are often used to study the evolution in this 



and in such a case either quasi-linear approximation schemes ( Bernardeau et al.L 2002b, (Zel'Dovich 


19701 'Gurbatov, Saichev, & Shandarin', '1989"; 'MataiTcse et al.', '1992^, 'Brainerd, Scherrer, & Villumsen 


1993; Bagla & Padmanabhaa 1994; Sahni & Coles, 1995; Hui & Bertschinger, 1996 


) or scaling rela- 


tions (llDavis & Peebles, 1977: Hamilton et al., 199ll|jain, Mo, & White, 19951 Kanekai 


20001 Mai 1998 


Nitvananda & Padmanabhanl 1994: Padmanabhan et al. , 


1996£ iPeacock & Doddsl 1994 Padmanabhan 


1996; Peacock & Dodds, 


19961 Smith et al., 2003) suffice. However, even the approximation schemes and 



scaling relations must be compared with simulations before these can be used with confidence. 
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Last three decades have seen a rapid development of techniques and computing power for cosmo- 
logical simulations and the results of these simulations have provided valuable insight into the study 
of structure formation. The state of the art simulations used less than 10^ particles two decades ago 
(lEfstathiou et al.l Il988b and if the improvement had been due only to computing power then the largest 
simulation possible today should hav e been around 10^ pa rticles, whereas the largest simulations done till 
date used more than 10^" particles dSpringel et al.L l2005b . Evidently, development of n ew methods and 



optim is ations has also played a significant role in the evolution of simulation studies (lEfstathiou et al 



1985 [) . (iBarnes & HuA [l^^ iGreengard & RokhliU fl987l: iBouchet & HernauisA|l988l:IJernigan & Portei 
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Dubinskil 1 199a iKravtso v, Klvpin, & Kho khlovt 1 19971; iMacfarland et al.r il998t iBode. Ostriker. & Xu , 



20001; iBrieu & Evrardl bOOQ; iDehnem i2000t iKnebe Green & Binnevi |200U ISpringel. Yoshida. & White . 
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Thacker & Couchman , 2006h Along the way, code developers have also successfully met the challange 
posed by the emergence of distributed parallel programming. 

In this paper, we disc uss the performance characteristics of an optimisati on for tree codes suggested 
by Ba rnes fearnesl Il990l) . We do this in the context of the TreePM method dBaglal 120021; iBagla & Ravi 
l2003h where the tree method is used for computing the short range force. The TreePM method brings in an 
additional scale into the problem, i.e., the scale upto which the short range force is computed and this leads 
to non-trivial variations in error in force. 

The paper is organised as follows: we introduce the TreePM method in §2, and discuss the optimisation 
scheme in §3. Performance of the optimisation scheme is discussed in §4, and we discuss combining this 
with individual time steps for particles in §5. We end with a discussion in §6. 

2 THE TREEPM ALGORITHM 



The TreePM algorithm dBagla, l2002t iBagla & Ra\{ l2003b is a hybrid N-Bo dy method which com- 
bines the BH-Tree method (Barnes & Hut", 'l986|_^with the PM meth od (B agla & Padmanabhanl 
19971 iMerz. Pen. & Trad. l2005nKlvpin & Shandariiii Il983l; iMiUeii Il983l iBourhet & Kandrup. ,19851: 
Bouchet. Adam. & Pellatt, Il985l iHocknev & Eastwood! [l 9881) . The TreePM method exphcitly breaks the 
potential into a short-range and a longe-range component at a scale r^. The PM method is used to calculate 
long-range force and the short-range force is computed using the BH Tree method. Use of the BH Tree for 
short-range force calculation enhances the force resolution as compared to the PM method. 

The gravitational force is divided into a long range and a short range part using partitioning of unity in 
the Poisson equation. 
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Here and (/jjT are the short-range and long-range potentials in Fourier space, p is the density, G is the 
gravitational coupling constant and rg is the scale at which the splitting of the potential is done. The long- 
range force is solved in Fourier space with the PM method and the short-range force is solved in real space 
with the Tree method. The short range force in real space is: 
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Here erfc is the complementary error function. 

The short range force is below 1% of the total force at r > brg. The short range force is therefore 
computed within a sphere of radius rcut — Sr^. The short range force is computed using the BH tree 
method. The tree structure is built out of cells and particles. Cells may contain smaller cells (subcells) 
within them. Subcells can have even smaller cells within them, or they can contain a particle. In three 
dimensions, each cubic cell is divided into eight cubic subcells. Cells, as structures, have attributes like 
total mass, location of centre of mass and pointers to subcells. Particles, on the other hand have the usual 
attributes: position, velocity and mass. 

Force on a particle is computed by adding contribution of other particles or of cells. A cell that is 
sufficiently far away can be considered as a single entity and we can add the force due to the total mass 
contained in the cell from its centre of mass. If the cell is not sufficiently far away then we must consider its 
constituents, subcells and particles. Whether a cell can be accepted as a single entity for force calculation 
is decided by the cell acceptance criterion (CAC). We compute the ratio of the size of the cell d and the 
distance r from the particle in question to its centre of mass and compare it with a threshold value 

= -<9c (4) 
r 

The e rror in force increases with 9c. Poor choice of 9c can lead to significant errors jSalmon & Warreril 
Il994l) . Man y different approaches have been tried for the CAC in order t o minimize error as well as CPU 
time usage (ISalmon & Warrenl [l99l ISpringel. Yoshida. & WhiteLl200lh . The tree code gains over direct 
summation as the number of contributions to the force is much smaller than the number of particles. 

The TreePM method is characterised therefore by three para meters, rs,rcut and O c- For a discussion on 
the optimum choice of these parameters the reader is referred to (iBagla & Ravi 120031) . 

3 THE SCHEME OF GROUPS 

We first describe an optimization scheme due to (iBarnesl Il990l) . given in the paper with a curious title 
A modified tree code. Don't laugii, it runs. This scheme is easily portable to any N-body algorithm that 
uses tree data structures to compute forces. The origin of the optimisation is in the realisation that the tree 
walk used for computing forces is computationally the most expensive component of a tree code. The idea 
is to have a common interaction list for a group of particles that is sufficiently small. Given that we are 
working with a tree code, it is natural to identify a cell in the tree structure as a group. One can then add 
the contribution of particles within the group using direct pair summation. The cell acceptance criterion 
(CAC) for the tree walk needs to be modified in order to take the finite size of the group into account. In 
our implementation of the TreePM method, we modified the standard CAC in the following manner; 

d<{r- r„0 9c (5) 

where is the distance between the centre of mass of the group, and the group member that is farthest 
from the centre of mass. This is calculated once before the force calculation and does not add much in terms 
of overhead. 

The modified CAC can be thought of as the standard CAC with a distance dependent 9c, with the value 
of 9c decreasing at small r. As we require a larger number of operations for smaller 9c, each tree walk with 
the modified CAC is expected to require more CPU time than a tree walk with the standard CAC. However, 
as we do a tree walk for a group of particles in one go, CPU time is saved as the time taken for tree walk 
per particle comes down. 

There is an overhead as there is a pair-wise force calculation within the group. The cost of this overhead 
increases as the square of the number of particles in the group. In order to keep the overhead small, one 
would like the group to be sufficiently small compared to the size of the N-Body simulation and hence a 
maximum size Cmax and an upper bound on the number of particles in the group rip^^^ is used. An upper 
limit on the size of the group is pertinent because of the indirect effect through the change in the CAC. The 
effect of the additional parameter Cmax with the modified CAC will be seen when we discuss errors in sec- 
tion 4. Our implementation of the modified method by using a different definition of groups, with the addi- 
tional parameter Cmax and the modified CAC (eq.5) ensure that the short-range force is extremely accurate. 
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This i s different from previou s implementations dBarnesL [l990l: iMakinol [19911 lYoshikawa & Fukushigd. 
120051: IWadslevT Stadel. & Ouin n, 2004) where the group scheme was parametrized by just one parameter 
"^Pmax ™d the standard CAC (eq.4) used for tree traversal. We note in passing that the modified CAC is 
crucial in order to limit errors. Indeed, we find that working with the standard CAC leads to errors in short 
range force that are orders of magnitude larger 



3.1 Estimating Speedup 

We model the modified Tree/TreePM method with the aim of estimating the speedup that can be achieved. 
If N is the total number of particles, rip the typical number of particles in a group and Ug the number of 
groups then clearly we expect Ug x Up = N. The total time required for force calculation is a sum of the 
time taken up by the tree walk and the time taken up by pair wise calculation within the group. Actual 
calculation of the force, once the interaction list has been prepared takes very little time and can be ignored 
in this estimate, as can the time taken to construct the tree structure. The time taken is: 

, N 

Tg = ang In N + /dngUp ^ a — In iV + PNup (6) 

Up 

Here we have assumed that the time taken per tree walk scales as ©(In N) even with the modified CACQ. 
The time taken is smallest when 

; T„ . = 2/3iVnp 2a— IniV (7) 

/3 / fip 

Thus the optimum number of particles in the group scales weakly with the total number of particles. In the 
optimum situation, we expect the tree walk and the pair wise components to take the same amount of CPU 
time. 

For comparison, the time taken for force calculation in the standard TreePM is: 

T ^aN In (8) 

and we make the simplifying assumption that a is same for the two cases. The expected speed up is then 
given by: 

The speedup for the optimum configuration scales in the same manner as the typical number of particles 

per group. 

A more detailed analysis of this type can be found in (lMakinolfl99Tl) . 

The calculation we have presented above is approximate and ignores several factors, some of these have 
already been highlighted above. There are other subtleties like the role played by the finite range rcut over 
which the short range force is calculated. The size of a group (cmax) cannot be varied continuously, and 
hence Up is also restricted to a range of values. Further, the number of operations do not translate directly 
into CPU time as some calculations make optimal use of the capabilities of a CPU while others do not. For 
example, the pair wise calculation is likely to fare better on processors with a deep pipeline for execution 
whereas tree walk can not exploit this feature. The finite bandwidth of the CPU-memory connection also 
has an impact on the scaling with for large N. In the following section, we discuss the implementation 
of the modified TreePM method and the timing of the code with different values of parameters. 



' This is an approximation as we expect tlie tree walk to depend on Cmax, "pmax as well. The finite size of groups should 

lead to deviations from the C'(ln A'^) variation and the deviation should scale as the ratio of the volume of the group and the volume of 
the simulation box. As this ratio becomes smaller for large simulation boxes, we feel that the approximation we have made is justified. 
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Fig. 1 Time taken for computation of the short term force in the Modified TreePM method for 
an unclustered (left panel) and a clustered (right panel) distribution. Solid lines represent the 
time taken by a complete short-range force calculation. Dashed lines are the contribution to the 
force due to pairs within a group, the intra-group contribution. Dot-Dashed lines are the contri- 
butions to force due to tree walk. Purple, black and blue lines are for Cmax = 3.125, 6.25, 12.5 
respectively 



4 A MODIFIED TREEPM ALGORITHM WITH THE SCHEME OF GROUPS 

Tests of the TreePM method have shown that 95 — 98% of the time goes into the short-range force calcula- 
tion. Keeping this in mind the scheme of groups was introduced to optimize the short-range force calculation 
in terms of speed. A welcome featur e is more accurate fo rce computation. Since the optimum set of TreePM 
parameters have been discussed in dBagla & Ravi l2003h . we now look for the optimum choice of the ad- 
ditional parameters, Cmax and rip^^^, which describe the Modified TreePM algorithm. The analysis that 
follows is divided into two parts. First we look at the optimum values of Cmax and rip^^^^ which minimise 
the time for short range force computation. Second, we study errors in total and short range force with this 
new scheme. 

4.1 Optimum Parameters of The Modified TreePM Algoritlim 

We choose rs ~ 1, rcut ~ 5.2rs and Oc = 0.5 for the disc ussion that follows. With this choice the 
error in force for 99% of the particles is less than a few percent dBagla & Ravll20()3l) . We present analysis 
of performance of the modified TreePM for two different particle distributions taken from an iV— body 
simulation, with = N^^^ = 200^. 

- An unclustered distribution that corresponds to the initial conditions of an A^— body simulation. 

- A clustered distribution taken again from the same TV— body simulation. The scale of non-linearity for 
the clustered distribution is 8 grid lengths. 

We have verified that the nature of results does not change significantly for simulations with the number of 
particles ranging from 32"^ to 256'^. 
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Table 1 The following table lists the optimum values of (cmax , np^^^ ) for simulations of various 
sizes but having the same TreePM parameters; = 1, rcut = 5.2rs 9c = 0.5. 





„opt 
^max 




64 


4.0 


>1024 


128 


4.0 


>1024 


160 


5.0 


>1024 


200 


6.25 


>1000 



In figure 1 we show the time taken for computing the the short-range force (solid line) and determine the 
values of {Cmax , '^pmaT ) for which this timing is a minimum. Two leading contributions to the calculation 
of short range force are shown separately: 

- Intra-group particle-particle contribution (dashed line). 

- Time take for tree-walk and the related force calculation (dot-dashed line). 

Given that a group is a cell with maximum width 

Cmax = ^1^, (m is an integer), (10) 

Cmax can take therefore only discrete values. We choose to restrict upto Cmax ^ '^'Tcut as for larger cells, the 
dominant contribution to force on a given particle arises from the intra-group particle-particle interaction 
and the time taken for this is a sensitive function of the amplitude of clustering. 

The time taken for computing the short range force in both, unclustered (left panel) as well as clustered 
(right panel), distributions is qualitatively described by our model (see eqn.(6)). The pairwise force increases 
linearly with the maximum number of particles in a group and scales as rip where Up is 

the average number of particles in a group. The time taken for tree-walk decreases as n~^f^, reaches a 
minimum and then increases with rip^^^ (blue line) for the largest Cmax used here. For other values of Cmax 
we see the timing levelling off near the minimum. The scaling as rip^f^ is different from l/up we used 
in the analytical model and the reason for this is likely to be in the approximations we used. We find that 
the scaling approaches as we consider simulations with a larger number of particles. One crucial 

reason for different scaling is the modified CAC we use here. This effectively leads to a smaller 9c for cells 
closer to the group and the number of such cells increases with c,nax- 

In both cases the total time is still dominated by the tree-walk. The plateaus in the plots often indicate 
that the number of particles in a group of maximum size Cmax have saturated. At initial times where the 
fluctuations are small there is also a lower bound on the number of particles contained in a group. In the 
clustered distribution there is no such lower bound but an upper bound, larger than the corresponding upper 
bound in case of the unclustered distribution, exists and is dictated by the amplitude of clustering in the 
distribution of particles. 

From figure 1 we see that the optimum values of {Cmax, np^^^) = (12.5, 1024) & (6.25, > 1024) given 
by the minima of the solid blue line and the plateau of the solid black line respectively. In the latter case 
the time taken does not change for rip„„^ > 1024 and we consider this to be a useful feature that makes 
the Cmax = 6.25 a better choice as fine tuning of Up^^^ is not required. For the optimum {Cmax, "-pmax) 
one can see that force computation takes the same time for the clustered and the unclustered distributions. 
Table 1 lists optimum values for {Cm.ax, np^^^) for N-Body simulations with different numbers of particles. 
These numbers indicate that a good choice for Cmax is one which is closest to rcut, i-e. Cmax ^ fcut- The 
parameter rip^^^ can be taken to be 10'^ < np^^^ as we find little variation beyond this point. 

One can get an estimate of the overheads for the group scheme by looking at the limit of Up^^^ 1. 
Here we compare the performance of the TreePM with the modified code by plotting the time taken by the 
former as a large dot on the same panel where the time taken for the modified code is shown in the form of 
curves. The difference between these timings is around 0.1%. 
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Fig. 2 Time taken for short-range force calculation per particle per step for N ~ 32^ to N — 
256'^ for the TreePM (thick line) and the Modified TreePM (thin line). The solid line shows the 
performance of the codes on a single core if Intel 5160 (3.0 GHz) processor and the dashed line 
shows the performance on a single core of the AMD Barcelona (2. 1 GHz) processor 



The speedup for the optimal configuration of the modified TreePM, as compared with the base TreePM 
code is ~ 83. This is a huge gain and has to do with better utilisation of the CPU cache. The speedup is less 
impressive for smaller simulations, and is larger for bigger simulations. This is shown in figure 2 where we 
plot the time taken for force calculation per particle per step as a function of the total number of particles 
in the simulation. This is shown for the TreePM as well as the modified TreePM codes. Performance on 
two different types of processors is shown here to demonstrate that the optimisation works equally well on 
these. One can see that the TreePM code becomes (CPU-Memory) bandwidth limited for simulations with 
more than 64'^ particles and the time taken increases more rapidly than O(lniV). This does not happen in 
case of the modified TreePM where the scaling is C'(lniV) throughout. It is this difference that leads to 
impressive speedup for large simulations. For simulations with up to 64'^ particles we get a speedup by a 
factor of four. 



4.2 Errors in the Modified TreePM Force 



We now study errors in force for the modified TreePM force. Errors are calculated with respect to a reference 
force computed with very conservative values of TreePM parameters: 9c = 0.01, = 4.0, Vcut = 5.2rs. 
With these values the reference force is accurate to 0.1% (Bagl a & Rav,.2003,) . 

|frer-f| ^^^^ 



Iref 



Here e, fref , f are the relative error, reference force and the typical force in a simulation. We calculate errors 
for two distributions of particles: 
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Fig. 3 Distribution of errors in total force for different values of 9c with c,nax = 2.0 for un- 
clustered (top left panel) and clustered (top right panel) distributions. Dashed, dot-dash-dot- 
dash, dotted, dash-dot-dot-dot lines are for 9c = 0.1, 0.2, 0.3 and 0.5 respectively. We used 
rs = 1.0, Vcut = 5.2rs for these plots. The corresponding plots for c„iax = 4.0 are shown in 
lower left and lower right panels. 



- A uniform (unclustered) distribution. 

- A clustered distribution taken from an A^— body simulation. 

Both distributions have Nj^^ix = N = 128'^ particles. The exercise we follow is similar to dBagla & RayL 
l2003h but now we wish to highlight the effect of groups on errors in force. 

Figure 3 shows the distribution of errors for different values of 9c- The results are shown for both 
the distributions being studied here: the unclustered distribution (left panels) and the clustered distribution 
(right panels). The top row is with Cmax = 2.0 and the lower row is for Cmax = 4.0. We used Ts = 1.0 and 
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Fig. 4 Distribution of errors in short range force for different values of 9c for unclustered (left 
panel) and clustered (right panel) distributions. Dashed, dot-dash-dot-dash, dotted, dash-dot-dot- 
dot lines are for Oc = 0.1, 0.2, 0.3 and 0.5 respectively. Cmax ~ 4.0, rg = 1.0, Vcut = 5.2rs was 
used for these plots. 



r^ut ~ 5.2rs for this figure. In the case of the unclustered distribution error decreases with 9^ but saturates 
at 9c = 0.3 and does not decrease as 9c is decreased further The situation is different for the clustered 
distribution where the errors are not sensitive to 9c- This suggests that the errors are dominated by the long- 
range force. The unclustered distribution has larger errors than the clustered distribution. This is because 
the net force on each particle in the unclustered distribution is small, whereas force due to a cell with many 
particles is large and many such large contributions have to cancel out to give a small net force. Numerical 
errors in adding and subtracting these large numbers seem to systematically give a net large error Larger 
cells contribute for larger 9c hence the variation with 9c is more dramatic in the unclustered case. This effect 
is apparent in the discussion of the short-range force. With 9c ~ 0.3, 1% of particles have errors in total 
force greater than 4% in the unclustered case and 1.6% in clustered case. 

The effect of the modified CAC (eqn. (5)) is seen by comparing the plots of figure 3 for the unclustered 
distribution. The modified CAC is more stringent for larger value of Cmax and this is clearly seen in the 
error for 9c = 0.5. There is a lack of variation in errors with 9c for 9c < 0.5 indicating that at this stage 
the dominant contribution to errors is from the long range force calculation. The short-range force is more 
accurate with a larger Cmax due to two reasons: 

- The modified CAC has an r dependent opening angle threshold and requires a smaller 9c at small 
distances. This is likely to reduce errors. 

- The number of particles in a group is larger for larger Cmax- As the contribution of force from these 
particles is computed by a direct summation over pairs, the errors are negligible. 

One may raise the concern that the errors in the present approach are likely to depend on location of a 
particle within the group. We have checked for anisotropics in error in force calculation in groups that may 
result and we do not find any noteworthy anisotropic component. 

Next we look at the errors in short-range force for the same distributions (unclustered and clustered) of 
particles for various values of 9c- The reference short-range force was computed with 9c = 0.01, rg = 1.0, 
fcut = 5-2rs and Cmax = 4.0. We only varied 9c and continued to use rg = 1.0, rcut = 5.2rs, Cmax = 4.0 
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Fig. 5 Scaling of the time taken for short-range force calculation with 6c for the TreePM (left 
panel) and the modified TreePM (right panel). 

for computing the short-range force and then the errors. For the purpose of computing errors in the short 
range force, we cannot vary rg between the reference and the test model. 

The effect of decreasing Oc is more dramatic on errors in the short-range force. For the unclustered case 
the errors for 1% of the particles decreases by nearly 2.5 decades to 2 x 10^**% for 6c — 0.1. In the clustered 
case the errors for 1% of the particles decrease by nearly 1.5 decades to 3.4 x 10~^% for 6c = 0.1. One can 
obtain very high accuracy in short-range force by taking 6c = 0.2. As the short range force is the dominant 
one at small scales, the TreePM code can be used to follow the local dynamics fairly well by using a smaller 
dc- The impact of a small 6c on CPU time remains to be seen though. 

In figure 5 we look at how the CPU time for force calculation scales with 6c for the TreePM (left panel) 
and the modified TreePM (right panel). We compute the time taken for short-range force calculation per 
particle per timestep. We have seen in figure 1 and the corresponding discussion that clustering does not se- 
riously affect the performance of the TreePM code. We therefore do not repeat the exercice for distributions 
with different levels of clustering. We performed the short-range force timing on a clustered distribution 
taken from an A^— body simulation with N^c>x = N = 128^. We used = 1.0 and rent = 4.0 for the 
TreePM and the modified TreePM. In addition Cmax = 4.0, rip^^^ = 1024 were used for timing the mod- 
ified TreePM. When 6c is decreased from 0.5 to 0.2 the time for force computation per particle increases 
by 7.2% for the TreePM and 21% for the modified TreePM. The speedup of the modified TreePM over 
the TreePM when 6c is reduced from 0.5 to 0.2 decreases from 22.2 to 19.6, respectively. A nice feature 
of TreePM codes is that unlike tree codes, the CPU time taken by TreePM codes is less sensitive to 6^ 
Thus one can obtain much higher accuracy for the short range force with a TreePM code for a considerably 
smaller cost in terms of the CPU time. 

5 A HIERARCHY OF TIMESTEPS 

Due to the existence of a large range of dynamical time scales in a simulation of large scale structures, 
computing forces for slowly moving particles at every timestep is not required. It is better to integrate 

^ For example, the vaiiation in CPU time for a tree code increases by about 500% for the same change in 9c for a simulation with 
N f» 10*, and the increase in CPU time is larger for simulations with a larger number of particles iHernguisll 19871) . 
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Table 2 This table lists the time taken for a complete simulation run for the unoptimised TreePM, 
TreePM with hierarcical time steps, TreePM with the group scheme, and finally the TreePM with 
the group scheme as well as the hierarchical time steps and their speedup with respect to the base 
TreePM. 



Run 


Groups 


individual i'imesteps 


I'ime (sees) 


Speedup w.r.t Run 1 


1 


No 


No 


401983 


1.0 


2 


No 


Yes 


145240 


2.77 


i 


Yes 


No 


67639 


5.94 


4 


Yes 


Yes 


31612 


12.72 



the orbits of rapidly moving particles with a smaller timestep than those that move relatively slowly, this 
reduces the number of force calculations that are required. As force calculation is the most time consuming 
component of an N-Body code, this results in a significant reduction of the CP U time required . We have 
implemented a hierarchical time integrator similar to that used in GADGET-2 (ISpringell l2005h . in which 
particle trajectories are integrated with individual timesteps and synchronised with the largest timestep. 
As we allow the block time stef0 to vary with time, we work with the so called KDK approach (Kick- 
Drift-Kick) in which velocties are updated in two half steps whereas position is updated in a full step. It 
can be shown that with a variable time step, KDK performs better than DKD (Drift-Kick-Drift) (see the 
GADGET-2 paper (ISpringelLl2005l) for details.). In our implementation of the hierarchy of time steps, the 
smaller time steps differ by an integer power (n) of 2 from the largest, block time step. An array is then 
used to store the value n which determines the timestep of the particle. The code drifts all the particles 
with the smallest timestep to the next time, where a force computation is done for particles that require an 
updation of velocity (Kick). We have tested the robustness of the hie rarchical KDK integrator by succesfully 
integrating the 3— body problem discussed by Szebehely & Peters, (ISzebehely & Peterslll967h . 

Solving the equation of motion with a hierarchy of time steps can be combined with the group scheme. 
Since tree construction takes a small fraction of the total time, a new tree can be constructed whenever 
particles require an updation of velocity. The groups that contain such particles can then be identified and 
particles within each group can be reordered into two disjoint sets: ones that need an updation of velocity 
and others that don't. Force is computed only for particles in the first set. Since each group represents a very 
small fraction of the total number of particles, the overhead of reordering particles is negligible. 

Table 2 lists the time taken for a complete simulation run for the unoptimised TreePM, TreePM with 
hierarcical time steps, TreePM with the group scheme, and finally the TreePM with the group scheme as 
well as the hierarchical time steps and their speedup with respect to the base TreePM. The model used for 
this comparison is a power law model with n = —1.0, Nj^^^ = N = 64^. We used = 1.0, Vcut = 5.2rs, 
dc = 0.5 and e = 0.2 in all the runs. Here e is the softening length. We used Cmax = 4.0 and rip^^^ = 1024 
for the modified TreePM. 

We note that the hierarchical integrator gives a speedup of better than a factor 2, irrespective of whether 
the scheme of groups is used or not. The speedup is larger if the softening e is smaller, as the number of 
levels in the hierarchy increases with decreasing e. The scheme of groups on the other hand gives a speedup 
of 4 or better for small simulations, and a much larger speedup for bigger simulations. This speedup has 
little dependence on the TreePM parameters, i.e. Oc, Vg, Vcut- The combination of two optimisations gives 
us a speedup of 10 or more even for small simulations. 

6 DISCUSSION 

The scheme of groups when combined with a hierarchical integrator for the equation of motion guarantees 
a speedup of better than 10 for any A^— body code which uses tree structures for computing forces. From 



^ Same as the largest time step. 



12 



Khandai & Bagla 



the algorithmic point of view one does not expect a much larger speedup for larger N. However, as seen 
in Figure 2, the scheme also allows us to make better use of the cache on CPUs and the effective speedup 
can be ev en more impre ssive. We have demonstrated that memory overhead is negligible, and as was ob- 
served in jBarnesl Il990l) this optimisation just takes around 200 extra lines of code. A welcome feature 
is more accurate force computation than the code without this modification. This modification in principle 
introduces two additional parameters {c„iax, n-p^^^), but these are not independent and we have found that 
Cmax ~ Tcut and Up^^^ > 10'^ are good choices across a range of simulation sizes. 

Our analysis of the optimisation has been restricted to fixed resolution simulations. In case of zoom-in 
simulations the range of time scales is much larger and a more complex approach for combining the group 
scheme with the hierarchy of time steps may be required. Relative efficacy of the two optimisations may be 
very different in such a case when compared with the example studied in the previous section. 

In summary, we would like to point out that the scheme of groups leads to a significant optimisation of 
the TreePM method. The amount of CPU time saved is significant even for small simulations, but the cache 
optimisation aspect leads to even more significant gains for large simulations. We have shown in this paper, 
that it is possible to incorporate the scheme in a simple manner in any tree based code. The overall gain is 
very impressive as we are able to combine this with the use of a hierarchy of time steps. The possibility of 
combining the two optimisations has been explored in this work for the first time. 
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